This suite of tools is compatible with CRISPResso v.2.0.31- v.2.0.40 and has the following modes:
Note that these tools are incompatible with prime editing.
Download CRISPResso2_downstream source files from the CRISPResso2 Github repository.
The necessary package dependencies are best installed in a conda environment.
Use one of the crispresso_downstream_env.yml files to generate a crispresso_downstream_env environment:
For local use on Mac OSX systems:
conda env create -f crispresso_downstream_env_osx.yml
For use on Unix/Linux systems:
conda env create -f crispresso_downstream_env_non_osx.yml
The environment contains the following dependencies:
python=2.7.15
R>=3.5.1
r-optparse
r-tidyselect
r-tidyverse
r-RColorBrewer
r-grid
r-gtable
r-extrafont
r-scales
r-effsize
r-cowplot
Once the conda environment is generated, activate it with the following command:
conda activate crispresso_downstream_env
To deactivate the conda enviroment after running the analyses, input the following:
conda deactivate crispresso_downstream_env
All commands must be input from within the CRISPResso2_downstream source directory.
Sample off-target CRISPRessoPooled outputs and analysis inputs from Zeng_et_al.1 (https://www.nature.com/articles/s41591-020-0790-y?draft=collection) have been provided in the Sample_data.zip file. The input sequences were generated using IDT’s rhAMPSeq panels. All sample inputs and outputs provided in this documentation are drawn from the same dataset and analysis. In Jing et al., OT1 is the current analysis’ 1620_OT_17, and OT2 corresponds to 1620_OT_19. 1620_OT_1 corresponds to the on-target.
1Zeng, J., Wu, Y., Ren, C. et al. Therapeutic base editing of human hematopoietic stem cells. Nat Med 26, 535–541 (2020). https://doi.org/10.1038/s41591-020-0790-y
-c, –crisp_out_dir: path to folder storing CRISPRessoPooled output directories
-d, –dataID: data identifier common to all names of CRISPRessoPooled input amplicons ( the name following “CRISPResso_on_” ), regular expressions accepted
-m, –mode: analysis modes [default collapse_only]:
-p, –percent_freq_cutoff: minimum frequency an allele (represented by rows in the collapsed allele tables) must appear in any one CRISPResso run/amplicon (represented by columns in the collapsed allele tables) to be included in the collapsed output table. After applying this filter, the allele frequencies are re-normalized so that all frequencies sum to 100%. [default 0]
–CRISPRessoBatch: for any collapse mode: Use CRISPRessoBatch files as inputs [default FALSE]
–CRISPRessoPooled: for any collapse mode: Use CRISPRessoPooled files as inputs [default FALSE]
-n, –noSub: do not include substitutions as edits in output table [default FALSE]
-f, –conversion_nuc_from: for any BE mode: the nucleotide targeted by the base editor [default C]
-t, –conversion_nuc_to: for any BE mode: the nucleotide(s) produced by the base editor. If multiple nucleotides, enter all letters without separators (ex. ATG) [default T]
-b, –base_edit_window: for any BE mode: quantification window range (joined by a hyphen) for base editing conversions within the spacer/guide sequence; the first base pair of the guide sequence is 1. For example, if the base editing quantification window is 3-10 and the spacer sequence is TTTATCACAGGCTCCAGGAA, acceptable base edits must fall in the TTTATCACAGGCTCCAGGAA (bold) sequence. [default 3-10]
-i, –indel_window: for any BE mode: quantification window range (joined by a hyphen) for acceptable indels within the spacer/guide sequence; the first base pair of the guide sequence is 1. For example, if the indel quantification window is 17-18 and the spacer sequence is TTTATCACAGGCTCCAGGAA, acceptable indels must overlap the TTTATCACAGGCTCCAGGAA (bold) sequence. [default 17-18]
-s, –ot_sample_csv: for any OT mode: path to sample csv file (see “Input files” below for details)
-r, –ref_seq_csv: for any OT mode: path to reference and guide sequences csv file (see “Input files” below for details)
-B, –be_summary_exists: for OT_only mode: add this flag if BE summaries already exist AND set [-f, -t] if not default values) [default FALSE]
-v, –sort_by_pval: for any OT mode: sort off-targets by t-test p-value instead of off-target names in output figures [default FALSE]
-e, –scale_size_by_editing_freq: for any OT mode: add this flag to separate the points in OT % editing scatterplot by size according to sample read coverage [default FALSE]
-l, –low_coverage: for any OT mode with –editing_freq_scale flag: the upper read count cutoff for “low-coverage” amplicons/samples [default 1000]
-u, –high_coverage: for any OT mode with –editing_freq_scale flag: the lower read count cutoff for “high-coverage” amplicons/samples [default 10000]
Follow the CRISPResso2 documentation to run either CRISPRessoBatch or CRISPRessoPooled.
When running CRISPRessoPooled and multiple guides are used for the same amplicon, the full names of all guides must be included and separated by underscores in pooled.txt (ex. when guides/targets 1620_OT_10 and 1620_OT_14 are on the same amplicon, name the run “1620_OT_10_1620_OT_14”).
For easy identification and sorting, it is best to name off-targets guide_OT_#### (ex. 1620_OT_12) beginning with the on-target as “0” (ex. 1620_OT_000).
Set –min_reads_to_use_region lower if low read coverage is expected for any amplicon as amplicons with fewer reads aligned will not be included in the analysis.
Use –quantification_window_size and –quantification_window_center to set quantification window characteristics, not –quantification_window_coordinates.
All input files (ref_seqs_csv and ot_sample_csv) should be in the same directory as the CRISPRessoBatch/CRISPRessoPooled run output files.
File linking CRISPRessoPooled output files to biological samples/conditions necessary for off-target analysis (not necessary for collapse or BE modes). As the off-target visualization tool takes up to 30 samples, this table may have up to 30 rows (not including headers). Input headers exactly as they are shown in the sample table and column descriptions below.
| donor | condition | CRISPResso_dir_name | sample_name | R_color | R_fill | R_shape |
|---|---|---|---|---|---|---|
| Donor_6 | mock | all_merged_EP1116-Mock_S1_mix | Donor_6 mock | firebrick1 | firebrick1 | 1 |
| Donor_6 | edited | all_merged_BE1116-1620-1x_S2_mix | Donor_6 1EP | firebrick1 | firebrick1 | 16 |
| Donor_6 | edited | all_merged_BE1116-1620-2x_S3_mix | Donor_6 2EP | firebrick1 | firebrick1 | 16 |
| Donor_7 | mock | all_merged_BE1215-1-Mock_S4_mix | Donor_7 Mock | cornflowerblue | cornflowerblue | 1 |
| Donor_7 | edited | all_merged_BE1215-2-1620-1x_S5_mix | Donor_7 1EP | cornflowerblue | cornflowerblue | 16 |
| Donor_7 | edited | all_merged_BE1215-3-1620-2x_S6_mix | Donor_7 2EP | cornflowerblue | cornflowerblue | 16 |
| Donor_8 | mock | merged_BE0108-1-Mock_S7_mix | Donor_8 Mock | darkgoldenrod1 | darkgoldenrod1 | 1 |
| Donor_8 | edited | all_merged_BE0108-9-1620-2x_S8_mix | Donor_8 2EP | darkgoldenrod1 | darkgoldenrod1 | 16 |
Necessary columns:
donor = cell donor ID + technical replicate (if applicable)
condition = either “mock” or “edited”
CRISPResso_dir_name = names of CRISPRessoPooled output directories following CRISPRessoPooled_on_
sample_name = sample names to be displayed in off-target output figure (text separated by spaces will be displayed on separate lines in the final figure)
Optional aesthetics columns (see options below):
R_color = R grDevices::colors color names or hex codes indicating the colors differentiating CRISPRessoPooled runs/samples in the off-target editing dotplot. If the column does not exist, default colors are generated.
R_fill = R grDevices::colors color names or hex codes, same as R_color options. By default, R_fill options are the same as R_color options.
R_shape = R ggplot2 shape options. By default, control/mock samples are unfilled circles (16), and edited samples are filled circles (1).
File containing the reference amplicon sequence, guide sequence, and PAM sequence for each off-target. The table is very similar to the CRIPSRessoPooled pooled.txt file and is necessary for off-target analysis (not necessary for collapse or BE modes). Unlike the CRISPRessoPooled pooled.txt file, which allows identification of multiple guides per reference amplicon, the ref_seq_csv requires that each guide must be in an individual row.
Input headers exactly as they are shown in the sample table and column descriptions below.
| ot_id | amplicon_sequence | aligned_guide_seq | guide_sequence | pam |
|---|---|---|---|---|
| 1620_OT_1 | CAGGTAATAACATAGGCCAG… | TTTATCACAGGCTCCAGGAA | TTTATCACAGGCTCCAGGAA | GGG |
| 1620_OT_10 | GCCCAACCAAATCAATATGA… | TTTGTCACAGTCTTCAGGAA | TTTGTCACAGTCTTCAGGAA | AGG |
| 1620_OT_11 | AGATTTCAAGACAAA… | TTTATCTAATGCTCCAGGAA | TTTATCTAATGCTCCAGGAA | AAG |
| 1620_OT_12 | ACACTCATACCTCCCGTTT… | TGTATGACAGGCTCCGGGAA | TGTATGACAGGCTCCGGGAA | AAG |
| 1620_OT_13 | TCCACAGTCCTTGTACT… | TTGATCACAGGCATCAGGAA | TTGATCACAGGCATCAGGAA | CAG |
| 1620_OT_14 | ACCAGCAGCTGAGAGAAA… | TTGATCTCAGGCACCAGGAA | TTGATCTCAGGCACCAGGAA | CGG |
Input columns:
Each off-target “guide” should be in a separate row regardless of whether they are on the same amplicon. With the exception of ot_id and guide_sequence, all sequences must be uppercase ATG.
ot_id = off-target names (should match the names in the CRISPRessoPooled pooled.txt file exactly)
amplicon_sequence = from the CRISPRessoPooled pooled.txt file
aligned_guide_seq = 20-bp guide sequence in the CRISPRessoPooled pooled.txt file (no bulge placeholders “-” or lowercase letters)
guide_sequence = guide sequence to be displayed in off-target output figures (can contain bulge “-” placeholders and lowercase letters). The sequences in this column must be unique.
pam = PAM sequence to be displayed in off-target output figures (do not need to be cannonical PAMs)
The Collapse mode uses the Alleles_frequency_table_around_sgRNA_[guideseq].txt output by CRISPResso to summarize each indel in more readable terms (deletion lengths and locations, insertion sequences and locations, substitution conversions and locations) with the first nucleotide of the guide as base pair index 1. Only modifications within the set quantification window are counted.
Example Alleles_frequency_table_around_sgRNA_[guideseq].txt
Unlike in the CRISPResso allele tables, each row is an unique indel, not a unique read. The Collapse mode collapses all reads with the same indel. Each indel must appear at or greater than the percent_freq_cutoff set by the user in at least one sample, or it is filtered out of the table before all allele frequencies are renormalized to 100%.
collapse_only mode
Rscript crispresso_downstream_v2.0.40.R -c /Users/local_Jing_BE/1620/20200711_1620_ONESeq_rhAMPSeq_triplicates -d OT -m collapse_only -p 0 --CRISPRessoPooled
[CRISPResso_RUN_NAME]_collapsed_[percent_freq_cutoff].csv
Table of all collapsed indels across all samples in the CRISPRessoBatch or CRISPRessoPooled run generated from Alleles_frequency_table_around_sgRNA_[guideseq].txt (CRISPResso output).
Columns:
Unedited = logical indicating whether the row contains the frequency of unedited alleles
n_deleted = number of base pairs deleted in deletions overlapping the quantification window
n_inserted = number of base pairs inserted in insertions overlapping the quantification window
n_mutated = number of base pairs modified in substitutions overlapping the quantification window
indel = detailed summary of the indel within the quantification window set in CRISPResso (sorted alphanumerically, notation explained in section below)
*[run]_reads__[GUIDE_SEQ]* = number of reads of the indel (row) within the sample/CRISPResso run (column)
*[run]__[GUIDE_SEQ]* = percent frequency of the indel (row) within the sample/CRISPResso run (column)
Indel notation
In the indel column, base pair indexes are centered at the beginning of the aligned spacer sequence, with the first base pair of the spacer distal to the PAM counted as 1 (see figure below).
The summarized mutation notation in the indel column, which is consistent in all output files generated by this tool, is as follows:
Substitution: (ex. C6T, G10A) the first base pair is the reference nucleotide, the middle number is the base pair index, and the last nucleotide is the aligned nucleotide; C6T means that the C at the 6th bp in the spacer was converted to T
Deletion: (ex. -1 (18), -5 (19-23)) the minus indicates a deletion, the number immediately following the minus sign displays the deletion length, and the numbers in the parentheses indicate the bp indexes of the deletion; -1 (18) is a minus one deletion at the 18th bp of the spacer sequence
Insertion: (ex. +1 (T 17), +9 (GTTCCAGAG 11-19))) the plus indicates an insertion, the number immediately following the plus sign displays the insertion length, the nucleotides inside the parentheses make up the inserted sequence, and the bp index range following the sequence shows the location of the insertion; +1 (T 17) is a plus one insertion at the 17th bp of the spacer sequence
While a single allele may multiple substitutions, only one indel (insertion/deletion) is identified within the quantification window as an insertion and a deletion cannot take place at the same site. Each combination of substitutions and single indels are counted as individual alleles (and individual rows in the collapsed allele tables).
In addition, some indels extend beyond the window set by the –plot_window_size or –offset_around_cut_to_plot parameter in CRISPResso2. In such a case, the insertion summary only includes the insertion sequence included within the CRISPResso plot window (ex. +56 (AGGCTGAGATAACATGGGGG 11-66) only shows a 20-bp long sequence). When an indel begins upstream of the plot window and ends downstream of the plot window, it is unclear where the indel begins and ends based on the Alleles_frequency_table_around_sgRNA_[guideseq].txt file. In these cases, the indel is assumed to begin at the first base pair (5’ end) of the plot window. Should greater indel resolution be desired, the CRISPResso plot window may be increased and the downstream analysis repeated.
The BE mode filters the collapsed allele tables by indel to only account for base editing (target and resulting nucleotides set by -f and -t parameters) in the first 3-10 bps the guide and for indels in overlapping the 17-18 bps, the Cas9 cut site. All other edits are counted as “Unedited” in the output table.
collapse_BE mode
Rscript 20200610_comLine_tester.R -c /Users/local_Jing_BE/1620/20200619_1620_ONESeq_rhAMPSeq_triplicates -p 0 -d 1620 -m collapse_BE -f C -t T --CRISPRessoPooled -b 2-10 -i 17-18
BE_only mode
Rscript crispresso_downstream_v2.0.40.R -c /Users/local_Jing_BE/1620/20200619_CRISPRessoBatch_1620_OT_111 -p 0 -d Donor -m BE_only -f C -t ATG -b 2-10 -i 17-18
[CRISPResso_RUN_NAME]_BE_summary_[conversion].csv
Filtered allele table. The columns are the same as those of the collapsed allele tables (listed above.)
OT mode is generated for the purpose of visualizing off-target editing, but it can be re-purposed for any pool of amplicons ran in CRISPRessoPooled.
The OT mode performs simple t-test comparing percent editing in edited v. control samples and generates composite figures summarizing off-target editing.
collapse_BE_OT mode
Rscript crispresso_downstream_v2.0.40.R -c /Users/local_Jing_BE/1620/20200619_1620_ONESeq_rhAMPSeq_triplicates -p 0 -d 1620 -m collapse_BE_OT -f C -t ATG -b 2-10 -i 17-18 -s 20200619_1620_rhAMPSeq_samples.csv -r 1620_ONESeq_ref_seqs.csv --CRISPRessoPooled
collapse_OT mode
Rscript crispresso_downstream_v2.0.40.R -c /Users/IND_off_target/2020_CRISPResso2/20200710_DE_1450_rhAMPSeq -d OT -m collapse_OT -p 0 -n -s 202006_DE_1450_rhAMPSeq_samples.csv -r 202006_1450_0000_ref_seqs.csv --CRISPRessoPooled
OT_only mode (BE summary exists)
Rscript crispresso_downstream_v2.0.40.R -c /Users/local_Jing_BE/1620/20200326_1620_ONESeq_rhAMPSeq_CRISPRessoPooled -p 0 -d 1620 -m OT_only -f C -t T -s ../BE_rhAMPSeq_samples.csv -r ../1620_ONESeq_ref_seqs.csv --CRISPRessoPooled -B
OT_only mode
Rscript crispresso_downstream_v2.0.40.R -c /Users/IND_off_target/2020_CRISPResso2/20200710_DE_1450_rhAMPSeq -d OT -m OT_only -p 0 -n -s 202006_DE_1450_rhAMPSeq_samples.csv -r 202006_1450_0000_ref_seqs.csv --CRISPRessoPooled
YYYYMMDD_CRISPResso_OT_editing_summary.csv
A table displaying the total editing frequency per off-target, per sample.
Columns:
off_target = off-target names
guide_sequence = guide/off-target sequence (no PAM) including mismatches and bulges
pam = PAM sequence
sample = sample name
condition = control or edited
aligned_guide_seq = guide/off-target sequence without mismatches/bulges (the sequence used for alignment)
editing_freq = total editing frequncy per off-target, per sample
reads = number of edited reads
group = editing condition + sample name (same as in off-target composite figure below)
YYYYMMDD_CRISPResso_OTs_ttest.csv
This table displays summary statistics as well as the results of simple statistical tests comparing edited v. mock samples by off-target. Editing frequency (%) is pooled by off-target for all mock samples and all edited samples, and an independent, one-tailed t-test (95% confidence interval, Welch df) is performed to determine whether editing frequency is higher in edited samples compared to mock samples for each off-target. If fewer than 3 samples are included for either the edited or mock group, no statistical test is performed for the off-target. Likewise, if the variance is equal, no test is performed. In both cases, the p-value is NA in the output table.
In conjunction with the t.test, Cohen’s d (effect size) is also calculated using a 95% confidence interval (not pooled, Hedges correction applied).
Columns:
off_target = off-target names
edited/control_median = median value of edited/control editing frequencies (%) by off-target
edited/control_mean = mean value of edited/control editing frequencies (%) by off-target
edited/control_sd = standard deviation of edited/control editing frequencies (%) by off-target
ttest_p_value = independent, one-tailed t-test (95% confidence interval, Welch df) p-value comparing edited v. mock samples
eff_size = Cohen’s d (effect size) of difference between edited v. mock editing frequencies
significant = logical indicating whether the editing was significant for the off-target (for easy filtering)
off_target_Rplot.png/pdf
Example figure below. Figures are saved as individual .png files and pages in a single .pdf file.
The heatmap on the left displays the read coverage per amplicon, per sample. The off-target names and sequences in the center are followed by an * when the editing frequency in edited samples is significantly higher than that in mock/control samples. The editing frequency is displayed in the dotplot on the right. If the flag –sort_by_pval was used, the off-targets would be listed in increasing order of t-test p-values (should statistical tests be possible).
off_target_summary_Rplot.png/pdf
Example figure below. Figures are saved as individual .png files and pages in a single .pdf file.
The heatmap on the left is the same as that of the (non-summary) figures above. The editing frequency summary statistics are displayed in the dotplot on the right; all mock samples are pooled, and all edited samples are pooled.